Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

add fusion summary for LGAT #830

Merged
merged 9 commits into from
Nov 12, 2020

Conversation

jharenza
Copy link
Collaborator

@jharenza jharenza commented Nov 6, 2020

Purpose/implementation Section

What scientific question is your analysis addressing?

Adding fusion summary for LGAT tumors in preparation for #793

What was your approach?

  1. Added Fusions of interest: (KIAA1549--BRAF, FGFR1--TACC1, and MYB--QKI) and Genes of interest with fusions:
    BRAF
    ALK
    ROS1
    NTRK1
    NTRK2
    NTRK3
    PDGFRA
    FGFR2
    FGFR1
    MYB
    MYBL1
    RAF1
  2. Added all 3' kinase fusions which are in-frame and retain the kinase domain
  3. For 5' kinase fusions, added those which are in-frame and retain the kinase domain
  4. For 5' kinase fusions that don't meet 3., check whether it has a reciprocal fusion and whether that reciprocal is in-frame and the kinase domain is retained. Then, add those 5' kinase fusions.

What GitHub issue does your pull request address?

#808

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Double check the logic

Is there anything that you want to discuss further?

I kept all of this analysis in the same Rmd as the other fusion summary analyses because I am using / modifying those functions. I almost think some of this could be its own notebook since it is unique to LGAT, but could go either way. Thoughts on this?

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes

Results

What types of results are included (e.g., table, figure)?

table

What is your summary of the results?

Numbers of unique patient fusions:
Non-kinase fusions: 11
3' kinase fusions which are in-frame and retain the kinase domain: 167
5' kinase fusions, added those which are in-frame and retain the kinase domain: 1
5' kinase fusions that don't meet 3., has a reciprocal fusion in-frame and the kinase domain is retained: 12

Combined these into fusion_summary_lgat_foi.tsv

Reproducibility Checklist

  • The dependencies required to run the code in this pull request have been added to the project Dockerfile.
  • This analysis has been added to continuous integration.

Documentation Checklist

  • This analysis module has a README and it is up to date.
  • This analysis is recorded in the table in analyses/README.md and the entry is up to date.
  • The analytical code is documented and contains comments.

@cansavvy cansavvy requested review from cansavvy and removed request for jaclyn-taroni November 6, 2020 13:39
Copy link
Collaborator

@kgaonkar6 kgaonkar6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, just a few questions/ suggestions I had are commented below.

I wonder if the CI is errors out with the test data?

analyses/fusion-summary/01-fusion-summary.Rmd Outdated Show resolved Hide resolved
analyses/fusion-summary/01-fusion-summary.Rmd Show resolved Hide resolved
Copy link
Collaborator

@cansavvy cansavvy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can tell, it looks like this implements what's suggested in #808. 👍

My suggestions are mainly about where we might be able to shorten this up and add clarity in certain parts.

analyses/fusion-summary/01-fusion-summary.Rmd Outdated Show resolved Hide resolved
Gene2B %in% genes)
}
return(df %>%
select(Sample, FusionName, Fusion_Type, Gene1A, Gene1B, Gene2A, Gene2B, Gene1A_anno, Gene1B_anno, LeftBreakpoint, RightBreakpoint, reciprocal_exists, DomainRetainedGene1A, DomainRetainedGene1B) %>%
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear to me what the pattern is for what columns are being selected here. Can you tell me what the particular pattern or strategy for these columns being selected is?

Maybe we can come up with a more succinct way of selecting these if there's a pattern.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this, since we are specifically looking for in-frame (Fusion_Type), kinase (Gene1A_anno, Gene1B_anno), whether a reciprocal exists (reciprocal_exists ), and fusions with kinase domains retained (DomainRetainedGene1A, DomainRetainedGene1B), these columns required for LGAT fusion compilation. I added the left and right breakpoint columns to make sure that I was seeing unique fusions, although the rows may be unique enough with the combination of columns above. Let me check on this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the number of unique fusions without breakpoints == 265 and with breakpoints == 300, but the results are identical. I will remove those two columns!

analyses/fusion-summary/01-fusion-summary.Rmd Show resolved Hide resolved
analyses/fusion-summary/01-fusion-summary.Rmd Outdated Show resolved Hide resolved
analyses/fusion-summary/01-fusion-summary.Rmd Outdated Show resolved Hide resolved
analyses/fusion-summary/01-fusion-summary.Rmd Outdated Show resolved Hide resolved
analyses/fusion-summary/01-fusion-summary.Rmd Outdated Show resolved Hide resolved
analyses/fusion-summary/01-fusion-summary.Rmd Outdated Show resolved Hide resolved
analyses/fusion-summary/01-fusion-summary.Rmd Outdated Show resolved Hide resolved
analyses/fusion-summary/01-fusion-summary.Rmd Show resolved Hide resolved
jharenza added 4 commits November 6, 2020 13:04
fix variable name subsetFuseLGAT
add comments from @cansavvy and @kgaonkar6 code review
consolidate filterFusion() and filterLGATFusion() functions
- select only BS_ID and fusion
- add distinct
@jharenza
Copy link
Collaborator Author

jharenza commented Nov 6, 2020

@cansavvy and @kgaonkar6 - just made all of the updates and requested re-review. I actually found a bug in the 5' in-frame fusion filtering and had missed 3 FGFR1--TACC1 fusions, so the new total for all LGAT fusions is 193.

@cansavvy thank you SO much for your helpful tips - this was my first tidyverse code :)

Copy link
Collaborator

@cansavvy cansavvy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jharenza, thanks for these changes, I think this is coming along nicely! I just have one more area that I think we should further evaluate as far as what the code is doing.

analyses/fusion-summary/01-fusion-summary.Rmd Outdated Show resolved Hide resolved
analyses/fusion-summary/01-fusion-summary.Rmd Outdated Show resolved Hide resolved
analyses/fusion-summary/01-fusion-summary.Rmd Outdated Show resolved Hide resolved
analyses/fusion-summary/01-fusion-summary.Rmd Outdated Show resolved Hide resolved
analyses/fusion-summary/01-fusion-summary.Rmd Show resolved Hide resolved
analyses/fusion-summary/01-fusion-summary.Rmd Outdated Show resolved Hide resolved
analyses/fusion-summary/01-fusion-summary.Rmd Outdated Show resolved Hide resolved

Next, filter all fusions for 5' kinase fusions which have lost the kinase domain for reciprocal fusions which have a kinase domain intact and are in-frame.
Keep these for the final output file.
```{r}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking this chunk can still be simplified. In particular, I'm noticing that five_prime_domain_lost is from allFuseLGAT and five_prime_reciprocals is from five_prime_domain_lost but then we are adding back info from allFuseLGAT.

So basically both five_prime_domain_lost and five_prime_reciprocals are based on allFuseLGAT but we have extra manipulation steps. I think we can do this more directly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I'm trying to make this shorter, I'm a bit confused on these steps.

I tried to boil it down to what looks like the essential steps to get five_prime_kinase_keep, but as of now it does not get the same result. I think part of this is I still don't really understand all that is happening in this chunk. Can you take a look at my attempt as well evaluating your original code to see how we can make these steps more efficient and clear?

In particular I know the differences have to do with the paired up "select() then distinct()" steps that are done a few timees, but its unclear to me why each of these "select() then distinct()" steps are necessary.

five_prime_kinase_keep <- allFuseLGAT %>%
  # Make reciprocal variable
  mutate(reciprocal = paste(Gene1B, Gene1A, sep ="--")) %>%
  filter(# Keep the in-frame reciprocals which have the kinase domain in tact.
        grepl("Kinase", Gene1A_anno) & reciprocal_exists == "TRUE" & DomainRetainedGene1A == "No" |
        # Also keep select the in-frame reciprocals which have the kinase domain in tact.
        Fusion_Type == "in-frame" & DomainRetainedGene1B == "Yes") %>% 
  # Select to which variables we want here
  select(Sample, FusionName = reciprocal) %>% 
  distinct()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh

        grepl("Kinase", Gene1A_anno) & reciprocal_exists == "TRUE" & DomainRetainedGene1A == "No" |
        # Also keep select the in-frame reciprocals which have the kinase domain in tact.
        Fusion_Type == "in-frame" & DomainRetainedGene1B == "Yes")

This filtering you have here is filtering on the frame-ness/domain retention of the 5' kinase fusions and not their reciprocals (3' kinase fusions), which is why I was making that variable, then re-joining five_prime_reciprocals with the allFuseLGAT.

What I am trying to do is find any 5' kinases without the domain retained, but then go back to that original dataframe allFuseLGAT and collect the annotations of their reciprocal fusions (ie 3' kinase fusions' frameness and domain retention), assess whether those are in-frame and retain the kinase domain, then if there is a reciprocal that satisfies that, join both the 5' kinase fusion dataframe (domain lost) and 3' kinase fusion dataframe (inframe/domain retained).

Actually, I can remove:

# Add back the reciprocal of the reciprocal (the 5' kinase fusion) and keep this list
five_prime_kinase_keep <- five_prime_reciprocals %>%
  mutate(reciprocal = paste(Gene1B, Gene1A, sep ="--")) %>%
  select(Sample, FusionName = reciprocal) %>%
  distinct()

and use:

# Then, select the in-frame reciprocals which have the kinase domain in tact. Retain 5' kinase fusion information and update 3' fusion column name to FusionName for merging with allFuseLGAT.
five_prime_kinase_keep <- five_prime_domain_lost %>% 
  select(Sample, five_prime_kinase = FusionName, FusionName = three_prime_kinase) %>%
  left_join(allFuseLGAT, by = c("Sample", "FusionName")) %>%
  filter(Fusion_Type == "in-frame" & DomainRetainedGene1B == "Yes") %>%
  select(Sample, FusionName = five_prime_kinase) %>%
  distinct()

distinct() is needed because in some cases, there are multiple breakpoints for each fusion and sometimes multiple can have the same frame-ness, so this reduces them to one fusion call per sample.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated with commit - @cansavvy let me know if this clarifies!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay! Thanks for those updates! I think this looks good!

update code from @cansavvy code review
Copy link
Collaborator

@cansavvy cansavvy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me! Thanks for the clarification!


Next, filter all fusions for 5' kinase fusions which have lost the kinase domain for reciprocal fusions which have a kinase domain intact and are in-frame.
Keep these for the final output file.
```{r}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay! Thanks for those updates! I think this looks good!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants